{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0c6c50fc-15e1-4767-925a-53a37c430b9b",
   "metadata": {},
   "source": [
    "# How to load HTML\n",
    "\n",
    "The HyperText Markup Language or [HTML](https://en.wikipedia.org/wiki/HTML) is the standard markup language for documents designed to be displayed in a web browser.\n",
    "\n",
    "This covers how to load `HTML` documents into a LangChain [Document](https://v02.api.js.langchain.com/classes/langchain_core_documents.Document.html) objects that we can use downstream.\n",
    "\n",
    "Parsing HTML files often requires specialized tools. Here we demonstrate parsing via [Unstructured](https://unstructured-io.github.io/unstructured/). Head over to the integrations page to find integrations with additional services, such as [FireCrawl](/docs/integrations/document_loaders/web_loaders/firecrawl).\n",
    "\n",
    ":::info Prerequisites\n",
    "\n",
    "This guide assumes familiarity with the following concepts:\n",
    "\n",
    "- [Documents](/docs/concepts#document)\n",
    "- [Document Loaders](/docs/concepts#document-loaders)\n",
    "\n",
    ":::\n",
    "\n",
    "## Installation\n",
    "\n",
    "```{=mdx}\n",
    "import Npm2Yarn from \"@theme/Npm2Yarn\"\n",
    "\n",
    "<Npm2Yarn>\n",
    "  @langchain/community\n",
    "</Npm2Yarn>\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "868cfb85",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Although Unstructured has an open source offering, you're still required to provide an API key to access the service. To get everything up and running, follow these two steps:\n",
    "\n",
    "1. Download & start the Docker container:\n",
    "  \n",
    "```bash\n",
    "docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0\n",
    "```\n",
    "\n",
    "2. Get a free API key & API URL [here](https://unstructured.io/api-key), and set it in your environment (as per the Unstructured website, it may take up to an hour to allocate your API key & URL.):\n",
    "\n",
    "```bash\n",
    "export UNSTRUCTURED_API_KEY=\"...\"\n",
    "# Replace with your `Full URL` from the email\n",
    "export UNSTRUCTURED_API_URL=\"https://<ORG_NAME>-<SECRET>.api.unstructuredapp.io/general/v0/general\" \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4d93b2e",
   "metadata": {},
   "source": [
    "## Loading HTML with Unstructured"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "7d167ca3-c7c7-4ef0-b509-080629f0f482",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[\n",
      "  Document {\n",
      "    pageContent: 'Word of the Day',\n",
      "    metadata: {\n",
      "      category_depth: 0,\n",
      "      languages: [Array],\n",
      "      filename: 'wordoftheday.html',\n",
      "      filetype: 'text/html',\n",
      "      category: 'Title'\n",
      "    }\n",
      "  },\n",
      "  Document {\n",
      "    pageContent: ': April 10, 2023',\n",
      "    metadata: {\n",
      "      emphasized_text_contents: [Array],\n",
      "      emphasized_text_tags: [Array],\n",
      "      languages: [Array],\n",
      "      parent_id: 'b845e60d85ff7d10abda4e5f9a37eec8',\n",
      "      filename: 'wordoftheday.html',\n",
      "      filetype: 'text/html',\n",
      "      category: 'UncategorizedText'\n",
      "    }\n",
      "  },\n",
      "  Document {\n",
      "    pageContent: 'foible',\n",
      "    metadata: {\n",
      "      category_depth: 1,\n",
      "      languages: [Array],\n",
      "      parent_id: 'b845e60d85ff7d10abda4e5f9a37eec8',\n",
      "      filename: 'wordoftheday.html',\n",
      "      filetype: 'text/html',\n",
      "      category: 'Title'\n",
      "    }\n",
      "  },\n",
      "  Document {\n",
      "    pageContent: 'play',\n",
      "    metadata: {\n",
      "      category_depth: 0,\n",
      "      link_texts: [Array],\n",
      "      link_urls: [Array],\n",
      "      link_start_indexes: [Array],\n",
      "      languages: [Array],\n",
      "      filename: 'wordoftheday.html',\n",
      "      filetype: 'text/html',\n",
      "      category: 'Title'\n",
      "    }\n",
      "  },\n",
      "  Document {\n",
      "    pageContent: 'noun',\n",
      "    metadata: {\n",
      "      category_depth: 0,\n",
      "      emphasized_text_contents: [Array],\n",
      "      emphasized_text_tags: [Array],\n",
      "      languages: [Array],\n",
      "      filename: 'wordoftheday.html',\n",
      "      filetype: 'text/html',\n",
      "      category: 'Title'\n",
      "    }\n",
      "  }\n",
      "]\n"
     ]
    }
   ],
   "source": [
    "import { UnstructuredLoader } from \"@langchain/community/document_loaders/fs/unstructured\";\n",
    "\n",
    "const filePath = \"../../../../libs/langchain-community/src/tools/fixtures/wordoftheday.html\"\n",
    "\n",
    "const loader = new UnstructuredLoader(filePath, {\n",
    "  apiKey: process.env.UNSTRUCTURED_API_KEY,\n",
    "  apiUrl: process.env.UNSTRUCTURED_API_URL,\n",
    "});\n",
    "\n",
    "const data = await loader.load()\n",
    "console.log(data.slice(0, 5));"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "TypeScript",
   "language": "typescript",
   "name": "tslab"
  },
  "language_info": {
   "codemirror_mode": {
    "mode": "typescript",
    "name": "javascript",
    "typescript": true
   },
   "file_extension": ".ts",
   "mimetype": "text/typescript",
   "name": "typescript",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
